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Abstract 

In this article we provide a formal framework for reidentification in 
general. We define n-confusion as a concept for modelling the anonymity 
of a database table and we prove that n-confusion is a generalization of k- 
anonymity. After a short survey on the different available definitions of k- 
anonymity for graphs we provide a new definition for fc-anonymous graph, 
which we consider to be the correct definition. We provide a description of 
the fc-anonymous graphs, both for the regular and the non-regular case. 
We also introduce the more flexible concept of (k, Z)-anonymous graph. 
Our definition of (k, Z)-anonymous graph is meant to replace a previous 
definition of (k, i)-anonymous graph, which we here prove to have severe 
weaknesses. Finally we provide a set of algorithms for fc-anonymization 
of graphs. 



1 Introduction 

In data privacy, the assessment of risk is one of the elements of major impor- 
tance. At present, several approaches have been studied in the literature. The 
major approaches are k-anonymity [T^J [13 HOI IS] , reidentification [52] and 
differential privacy [7]- 

In this paper, we focus on two of them: reidentification and fc-anonymity. 
The former evaluates the disclosure risk of a protected data set measuring the 
chances that an intruder can link his information and the one in the protected 
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data set. In contrast, fc-anonymity tries to avoid any reidentification, producing 
a protected data set where each record is cloaked into a set of other fc — 1 records. 

In this work we formalize the reidentification process and we use this formal- 
ization to discuss the concept of fc-anonymity and then propose the concept of 
n-confusion. We also prove that n-confusion is a generalization of fc-anonymity. 

Then, we discuss the application of the concept of fc-anonymity to graphs. 
At present, due to the interest of online social networks, several authors have 
studied data protection for graphs. It is relevant here to consider the works 
in [5J HJ Q31 [33] , in which alternative definitions of fc-anonymity for graphs have 
been presented. In this work we discuss these different definitions and we show 
that the definition in [5] have severe weaknesses. Then, we first provide an 
alternative definition for fc-anonymity that provides enough security at the cost 
of being quite restrictive, and later the definition of (fc, /)-anonymity, a relaxation 
of the former one. 

The paper discusses several properties of the definitions. In particular, we 
study the characterization of the fc-anonymous graphs. We also provide algo- 
rithms for transforming a graph into a fc-anonymized one, for calculating the 
degree of (fc, /)-anonymity of a graph given fc, and to increase the I of the (fc, /)- 
anonymity of a graph. 

The structure of the paper is as follows. Section[3]discusses disclosure risk on 
online social networks focusing on reidentification and fc-anonymity. Section [3J 
reviews previous approaches of fc-anonymity for graphs, and presents an attack 
for the approach introduced in [8] . Section [4] introduces a new definition for 
fc-anonymity and studies some properties of this definition. Section [5] presents 
a relaxation (fc, Z)-anonymity. Section [6] includes the algorithms that we have 
developed related to (fc, Z)-anonymity. The paper finishes with the conclusions. 

2 Disclosure risk evaluation 

In this section we first present a formal framework for evaluating disclosure risk 
in data privacy in general (see also pj5]). Then we will focus on disclosure risk 
for online social networks. 

2.1 Reidentification in privacy protected databases 

A database is a collection of records of data. In this article we will suppose 
that all records correspond to distinct individuals or objects. Every record 
has a unique identifier and is divided into attributes. The attributes can be 
very specific, as the attributes "height" or "gender", or more general, as the 
attributes "text" or "sequence of binary numbers" . 

Suppose that the database can be represented as a single table. Let the 
records be the rows of the table and let the attributes be the columns. The 
intersection of a row and an attribute is a cell in the table, and we call the data 
in the cells the entries of the database. 
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Let T be a table with n records and m attributes. We define the partition 
set V(T) of T to be the set of subsets of the underlying set of entries of T, 

u::/", / 

Definition 1. A method for anonymization of databases is any transformation 
or operator 

p: D -> D 
X h-» Y, 

where D is a space of databases. 

Then p, given a database X, returns a database Y. Since Y is a database, 
all entries in Y will correspond to a unique individual or object, which we will 
suppose to be the same individuals as the ones behind the records in X . Usually 
it is assumed that there is, in some sense, less sensible information about the 
individuals behind the records in X in the transformed database Y than there 
was in the original database X. 

Definition 2. Let p be a method for anonymization of databases, X a table with 
n records indexed by I in the space of tables D and Y = p{X) the anonymiza- 
tion of X using p. Then a re-identification method is a function that given a 
collection of entries y in V(Y) and some additional information from a space 
of auxiliary informations A, returns the probability that y correspond to entries 
from the record with index i G I, 

r: V(Y) x A -> [0,1]™ 

(y, a) i— > (P(y correspond to entries from X[i\) : i £ I) . 

Consider the objective probability distribution corresponding to the re-identification 
problem. Then, we require from a re-identification method that it returns a prob- 
ability distribution that is compatible with this probability, also after missing 
some relevant information. Compatibility can be modeled in terms of compati- 
bility of belief functions (see \16jj). 

In this definition, the probability r(y,a) could have been expressed as con- 
ditioned by a. However, in this article we prefer to use the notation above. 

Compatibility implies that the more evidence we have, the less entropy the 
probability has. Because of that, the method returns (l/n, . . . , l/n) if it has 
no evidence of the original record corresponding to the protected record y, and 
that, as the evidence increases, the probabilities of the corresponding indices are 
agumcntcd. An example of this situation is when the variables of a data set are 
protected independently by means of k-anonymity. Then, re-identification can 
be applied to protected data using only some of the attributes. The distributions 
computed by these methods should be compatible with the ones considered 
when all attributes are taken into account. When no attribute is considered, 
the algorithm should lead to the probability (l/n, . . . , l/n). 
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In this way we avoid re-identification methods with false positives, since 
these would disturb the rest of the discussion. Also, probabilities are assumed 
to be defined so that the same value will apply to different protected records 
whenever these have the same value. 

A common assumption is to consider that re-identification occurs when the 
probability function that is returned by the re-identification method takes the 
value 1 at one index, say at zo, and the value at all the other indices. That 
is, given the auxiliary information a there is probability 1 that y belongs to the 
record with index Iq in X. 

We say that the entries s G V(Y) are linked to a collection of indices J C I if 
the probabilities that are returned by the re-identification method take non-zero 
values over the indices J and are zero on the complement I \ J. In a nice and 
regular situation, a possible non-zero value for the re-identification method over 
J is then l/\J\. 

Here, auxiliary information denotes any information used to achieve a better 
performance of the re-identification process. It is common that researchers 
use parts of the original database X as such auxiliary information. One can 
however argue that for example knowledge of the method of anonymization is 
also auxiliary information. When the database covers only part of a population, 
and it is not known who is in the database, then information about individuals 
that are not in the original database can serve as auxiliary information. In 
general, we do not assume that the auxiliary information can be indexed by 
individuals or that it has any particular structure at all. 

Definition 3. We define the confusion of a method of re-identification r, with 
respect to the anonymized database Y = p(X), the auxiliary information A and 
the threshold t 6 [0, 1] as 

conf(r, Y, A, t) = inf M rt {y,A) 

yeY 

where 

' ini aeA \{i G / : r(y,a)[i] > t}\ 
, A , J ifmf aeA \{ieI:r(y,a)\i] >t}\ >0 



{ \I\ ifw£ aeA \{ieI:r(y,a)[%\ >t}\=0 

The space of auxiliary information contains any information that could be 
useful and accessible to the adversary. Examples of auxiliary information is 
information in the public domain and information on the method that was used 
in order to anonymize the data. A bad determination of A could imply that the 
confusion of a re-identification method is overestimated. 

We have assumed that we know who is in the database and who is not. It 
can be proved that, under this assumption, any additional information that is 
useful for a re-identification method can be deduced from the original database 
X. Hence, in this particular case, we may assume that A = X. 
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Definition 4. Given a space of databases D, a space of auxiliary information 
A and a method of anonymization of databases p, we say that p provides (n,t)- 
confusion if for all re- identification methods r, and all anonymized databases 
p(X) £ D the confusion of r with respect to p{X) and A is larger or equal to n 
for the fixed threshold < t < 1/n. 

The (n, i)-confusion therefore measures the smallest cardinality of a set of 
individuals for which the re-identification methods gives probability higher than 
t, calculated for all protected registers y S p{X). 

(n, i)-Confusion is more reliable as a measure of anonymity when n and t 
are both large, simultaneously. That is, an anonymization method providing 
(n 1 , i')-confusion is better or equal than one that provides (n, t)-confusion when 
t' > t and n! > n. This statement is based on the following observations. 

On the one hand, if n is small, then an adversary might be able to form a 
collusion of size n— 1 which could break the anonymity of the nth register. This 
issue is analogous to what we get if we implement fc-anonymity with a small fc. 

On the other hand, it is interesting to observe what may occur if the thresh- 
old t is much smaller than 1/n. For example, say that we apply all available 
re-identification methods to the protected record y. Suppose that the best re- 
sult gives us r(y, a)[io] = 0.9 and that there are n — 1 other indices i for which 
r(y, a)[i] = 0.1/(n — 1). Then we have (n, <)-confusion with t = 0.05/(n — 1). 
To avoid this issue, an adequate value for t could be approximately t = 1/n. 

Next we present a proposal for a definition of n-confusion. 

Definition 5. Let notations be as in Definition^ We say that the anonymiza- 
tion method p provides n-confusion if there is a t > such that p provides 
(n, t) -confusion. 

This definition of n-confusion is designed so that n will be the measure of 
the smallest number of indices in 1, . . . ,n for which the result is a non-zero 
probability, when applying a re-identification method to a protected record in 
p(X). In this way, anonymization methods for which re-identification returns 
very distinct probabilities also provide n-confusion, whenever at least n of these 
probabilities are non-zero for every protected record. This mimics fc-anonymity 
in the sense that in order to re-identify an individual with absolute certainty, a 
collusion of the k — 1 other target individuals is necessary. 

Other possible approaches arc to consider that a table satisfies n-confusion 

• when the highest value of the best re-identification method is taken at 
least n times, for every protected record y and all auxiliary information; 

• when the highest value of the best rc-identification method is at most 1/n. 

2.2 An approach for disclosure risk control: k-anonymity 

The concept of fc-anonymity JTSj [TTl [20l [21] encompasses a set of techniques 
for data protection that try to avoid reidentification risk. When protecting a 
database, we have fc-anonymity when a record is cloaked into a set of other k — 1 
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records. Thus, given the record of the intruder, reidentification always returns 
k indistinguishable records. 

We use the notation T(A) to say that T is a table with the set of attributes 
A, Let B C A be a set of attributes of the table. We denote the projection of 
the table on the attributes B by T[B]. We suppose that every record contains 
information about a unique individual. An identifier J in a database is an 
attribute such that it uniquely identifies the individuals behind the records. In 
particular, any entry in T[I] is unique. A quasi-identificr QI in the database is a 
collection of attributes {A\, . . . , A n } that belongs to the public domain (i.e. are 
known to an adversary), such that they in combination can uniquely, or almost 
uniquely, identify a record That is, the structure of the table allows for 
the possibility that an entry in T[QI] is unique, or that there are only a small 
number of equal entries. In the former case the entry in T[QI] uniquely identifies 
the individual behind the record and in the latter, the few other individuals with 
the same entries in T[QI] may form a collusion and use secret information about 
themselves in order to make this identification possible. 

The former case may be formalized as follows. Consider the table T obtained 
by permuting randomly the records of T. Let s be an element in V(T) such 
that the entries of s all belong to the same record in T (and therefore also in 
T). Then, if there is a method of reidentification r : T x A — > [0, 1]™ such that 
r(s, a)[i] = 1 for some a £ A and one index i, then s belongs to T[QT\. 

In the latter case, an s such that r(s, a) is large for a small subset J of indices 
and for the others (so that s is linked to J) would also belong to T[QI]. 

Example 6. If a table contains information on students in a school class, the 
attributes birth data and gender could be sufficient to determine to which in- 
dividual a record of the table corresponds, although it is possible that not all 
records will be uniquely identified in this way. Hence for this table, birth date 
and gender are an example of a quasi-identifier. 

The following definition of fc-anonymity appeared for the first time in [17) 
(see also the articles by Samarati [T5] and Sweeney [21]). 

Definition 7. A table T , that represents a database and has associated quasi- 
identifier QI, is k- anonymous if every sequence in T[QI] appears with at least 
k occurrences in T[QI\. 

We have the following relation between fc-anonymity and n-confusion. 

Theorem 8. Consider a database X and a space of auxiliary information A. 
Suppose that our knowledge on A has permitted us to determine correctly the 
quasi-identifier QI of X . Let p be a method of anonymization of databases that 
gives k-anonymity with respect to QI . Then there is a threshold < t < 1 such 
that any method of reidentification r is of confusion at least k with respect to 
p(X), r, and t, so that p provides k-confusion. 

Proof. Let A" be a table with records indexed by / and Y = p(X) a table that 
is fc-anonymous with respect to QI, obtained by applying the anonymization 
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method p to X. The assumption that our knowledge on A has permitted us 
to determine correctly the quasi-identiher QI, implies that we may express 
the auxiliary information as the restriction of the original table to the quasi- 
idcntifier, A = X[QI]. Then any rcidentification method r : Y x A — > [0, l]' 7 ' 
will take values r(y,a) in which at least k entries r(y, a)[i] are equal to x with 
< x < 1/k and the other entries are smaller than x. Define t as the minimum 
among all these x. Then the confusion of r is at least k, for the threshold 
< t < 1/k. Since X was any table and r any method of reidcntification, p 
provides fc-confusion. □ 

In some cases the threshold for a fc-anonymous table will be 1/k. 

Theorem [8] shows that any table that satisfies /c-anonymity also satisfy n- 
confusion with n := k. The converse is not true, that is, a table that satisfies n- 
confusion does not necessarily satisfy n-anonymity. Next we will see an example 
of this. 

Example 9. Let X be a numerical table with 30 distinct records (points) in R 3 . 
Suppose that we want to anonymize X according to k-anonymity with k = 3. A 
common approach for achieving k-anonymity is to apply a clustering algorithm 
to the table, see for example JSj/. A clustering algorithm returns a partition of 
the records so that the records in each class of the partition are in some sense 
similar. In this case the clustering algorithm returns a partition of the record 
set in 10 classes with exactly 3 records in each class, so that these records are 
points inside a ball of radius r from the average of the three points. Given the 
records (points) p\,p2,Pz £ R 3 , let A({pi,p2,P3}) represent their average, and 
V({PiiP2,P3}) represent the normalized vector that is perpendicular to the plane 
defined by the points p\,p2,pz- Let c(p) represent the points in the cluster of the 
point p. Two alternatives are considered for the definition of a 3-anonymization 
ofX. 

1. Replace p G X by A{c{p)); 

2. Let p,p',p" be the three points in a cluster c. Replace p by A{c), p' by 
A(c) + eV(c) and p" by A(c) — eV(c), where e is a positive real number 
smaller than the radius r of the ball that contains the points in the cluster 
c. 

Then, the first alternative satisfies both 3-anonymity and (3,l/3)-confusion. 
However, the second alternative does not satisfy 3- anonymity, but it does satisfy 
(3, 1/3) -confusion. 

2.3 k-Anonymity for graphs 

A graph is a pair (V, E), where V is a set of vertices and E is a family of 2-subsets 
of V called edges. Sometimes there is associated an additional information to 
a vertex or to an edge. Such information is called a label. Graphs can be 
used to represent for example a social network. In the common approach for 
representing a social network as a graph, individuals are represented as vertices, 
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information about relations between individuals is represented as edges and 
other information about the individuals or about the relations is represented as 
labels. 

The concept of fc-anonymity was initially defined for tables. In order to apply 
fc-anonymity to other data structures, observe that these can be represented in 
table form. For example, when applying fc-anonymity to graphs, the adjacency 
matrix of the graph is a representation of the graph in table form. The adjacency 
matrix of a graph is a matrix in which both the rows and the columns are 
indexed by the vertices of the graph and the entries represent the number of 
edges between the corresponding vertices. A table is obtained by taking the rows 
and the columns of the matrix to be the records and the attributes of the table, 
respectively. Then every vertex of the graph occurs as an index of a record in 
the table and the attributes indicate the existence of edges to the other vertices. 
Depending on the situation, what is considered to be interesting information 
about the graph may vary. Therefore, other attributes may be included in 
the table, like for example the degree of the vertices. The information given 
by the adjacency matrix is however enough to deduce any other information 
available about the graph. There are also other representations of graphs that 
contain the same information as the table just described. One example is the 
incidence matrix, in which the rows are indexed by the vertices, the columns 
are indexed by the edges and the entries indicate if the correponding vertex is 
on the corresponding edge. 

As we saw in the previous section, the concept of fc-anonymity is based on 
the existence of a quasi-identifier. A quasi-identifier is a collection of attributes 
of the table, and the quality of the anonymization of the table depends on the 
correct determination of the quasi-identifier. Usually it is the data owner who 
is the entity that executes the anonymization method and who therefore is the 
responsible for the correct determination of the quasi-identifier. In this process, 
the data owner must consider a table form representation of the graph. The 
choice of attributes for this table is of course crucial for the determination of the 
quasi-identifier. The data owner can not know in advance which information 
may be useful for the adversary. For example, the adversary could use the edge 
set of the vertices for the reidentification process, or he could use only the degree 
of the vertices. In the former example the adversary uses exactly the information 
given by the adjacency matrix, and in the latter example he uses information 
that can be derived from the adjacency matrix, and that can be attacked to the 
table as an additional attribute. A prudent data owner should assume that the 
table that represents the graph, and which serves for the determination of the 
quasi-identifier, contains all attributes that may be relevant for a reidentification 
intent of an adversary. We observe that this will occur exactly when the table 
that is defined for the adjacency matrix (or some other equivalent table) is fc- 
anonymized! This is so, because all other information about the graph can be 
deduced from this table. 
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2.4 Table data k-anonymity versus graph k-anonymity 



Previous authors on this subject seem to agree on the opinion that fc-anonymization 
for graphs differs from fc-anonymization of tables. This opinion complicates the 
application of the concept of fc-anonymity to graphs. We argue that graph 
fc-anonymity is a special case of fc-anonymity as defined by Sweeney. 

Below we list the arguments used in [23] to justify the difference between 
fc-anonymization of table form data and graph data. 

1. They claim that it is much more challenging to model the background 
knowledge of adversaries and attacks about social network data than that 
about relational data. On relational data, they say, it is often assumed 
that a set of attributes serving a quasi-idcntifier is used to associate data 
from multiple tables, and attacks mainly come from identifying individuals 
from the quasi-identifier. However, in a social network many pieces of in- 
formation can be used to identify individuals, such as labels of vertices and 
edges, neighborhood graphs, induced subgraphs, and their combinations. 
It is much more complicated and much more difficult than the relational 
case. 

2. They also claim that is much more challenging to measure the informa- 
tion loss in anonymizing social network data than that in anonymizing 
relational data. 

3. Finally, they claim that it is much more difficult to anonymize a social 
network than data in table form, since changing labels of vertices and 
edges may affect the neighborhoods of other vertices, and removing or 
adding vertices and edges may affect other vertices and edges as well as 
the properties of the network. 

Observe that only the first point is relevant for the definition of fc-anonymity, 
since it focus on the choice of quasi-identifier, at least if we consider this choice to 
form part of the definition of fc-anonymity for graphs. The second and the third 
points are only important when defining an algorithm for fc-anonymization of 
graphs. In particular the third point is of a completely practical nature. Also, 
it is important to realize that every kind of data has its peculiarities. Also 
relational table data can hide unexpected quasi-identifiers, caused by the origin 
of the data. Trajectorial data can be tricky, when anonymizing car trajectories 
it is important to check that the published trajectories are feasible. For example, 
a car can not drive over a lake. For graphs we have a similar situation. When 
anonymizing a graph it is important to check that we do not produce edges 
which contain only one vertex. Concluding, we see that the anonymization 
algorithm must take into account the underlying structure of the data type that 
is being anonymized. 

Regarding the first point, we have seen in the previous discussion that all 
available information about a graph can be deduced from its adjacency matrix. 
This implies that if the matrix is fc-anonymous, so is the graph, with respect to 
any graph property that can come in mind. 
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2.5 n-Confusion for graphs 

We will here make some remarks on n-confusion for graphs, although we spare 
a more detailed discussion on this subject for future work. 

The concept of n-confusion generalizes fc-anonymity, and permits to define 
methods of anonymization that do not provide fc-anonymity, but that do provide 
the same level of anonymity as does fc-anonymity. The main interest is then to 
minimize information loss. Just as for any table data, n-confusion can be used 
for privacy protection of data from social networks. We can separate data that 
is representable in graph form, from data that is not. Then a sketch for a family 
of methods of anonymization could be the following: 

1. Transform the graph data into a fc- anonymous graph; 

2. Observe that the fc-anonymous graph provides a partition of the vertex 
set (see Section @|, hence defining a clustering of the records. Now apply 
a method of anonymization providing n-confusion with n = fc to the every 
cluster independently. 

It is not hard to see (remembering that fc-anonymity is a special case of n- 
confusion), that the result is a method of anonymization that provides n- 
confusion with n = fc. Suitable methods of anonymization could be the fol- 
lowing: 

• Data distortion by probability distribution [T2]. This is a special case 
of synthetic data generation, in which the protected data is generated 
from what is determined to be the probability distribution of the original 
data. It is very important that the data generator provides non-reversible 
anonymity (as it is supposed to do). Otherwise, the result will not satisfy 
n-confusion. In particular, one should be careful with methods that gen- 
erate data with the same statistics as the original data, since too many 
restrictions may lead to a determined system of equations, and conse- 
quently, reversible anonymity. The exact characterization of data genera- 
tors that provides non-reversible anonymity is still to determine. Observe 
that although the new data asymptotically preserves statistics locally, if 
no further actions are taken, all global statistics will not be preserved. 
Also observe that if the clusters are small, then local statistics may not 
be preserved. 

The idea to combine a first step of clustering with a second step of data 
generation within the clusters, has previously been studied for non-graph 
data in [3] and [H], Section 4. 

• Data swapping. In this case, we suggest that all data should be uniformely 
scrambled, attribute by attribute. Then it is clear that the result satis- 
fies n-confusion. Also, some of both the global and the local statistics 
are preserved, since the data entries are the same as the original data en- 
tries. Future work includes an evaluation of the information loss. Possible 
relaxations could be discussed in order to lower the information loss. 
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3 Previous work 



Previous applications of fc-anonymity to graphs have suggested several different 
quasi-identificrs, resulting in different definitions of fc-anonymity for graphs. 
These definitions compete, and no agreement has been reached. We review 
some of the available definitions here below. 

3.1 k- Anonymous graphs in terms of structural queries 

Hay et al. [5] explore the potential of structural queries on graphs for the rei- 
dentification of vertices and propose a formalization of the graph anonymization 
problem based on fc-anonymity. Given a graph G = (V, E) and an anonymiza- 
tion of it, G' = (V',E'), they let an adversary post queries on the structure of 
G in a neigborhood of a fixed vertex v € V. The vertex sets V and V are 
assumed to be equal, so that if v is a vertex in the anonymized graph, then it is 
also a vertex in the original graph. They define the candidate set of the vertex 
v S V with respect to the query Q as the set of vertices candQ(x) C V such 
that the outcome of the query is the same for all vertices in candQ (x) , 

cand,Q(x) = {v S V : Q{x) = Q(v)}. 

As observed by Hay ct al. the candidate sets with respect to a fixed query 
form a partition of the vertice set V into equivalence classes. In their model 
of the behavior of an adversary, he posts a sequence of structural queries. The 
intersection of the results from this sequence of queries is compared with the 
additional information the adversary has on the vertex he wants to rcidentify 
and may then provide a refinement of the reidentification of a vertex compared 
to what a single query provides. A graph is then fc-candidate anonymous if it 
satisfies the following condition. 

Definition 10. J§jj Let Q be a structural query. An anonymized graph satisfies 
k- candidate anonymity given Q if: 

Vx e V,Vy G cand Q (x) : C Q ^[y] < l/k 

where CQ tX [y] is the probability, given Q, of taking candidate y for x. The 
authors define Cq jX [j/] = \/\candq(x)\ for each y £ candQ(x) and otherwise. 

The anonymization method proposed in [S] is based on the idea of fc-anonymity 
as a partition of the record set. An algorithm is described that uses simulated 
annealing in order to find a partition of the vertex set that satisfies the k- 
anonymity constraint and maximizes the descriptive properties of the relations 
between the classes of the partition. The algorithm returns a generalization of 
the graph; a set of super- vertices, corresponding to the classes of the partition 
of the vertices, connected by a set of super-edges (including super self-loops), 
corresponding to the structural relations between the classes. If the partition 
of the vertices is defined so that fc-anonymity is obtained with respect to the 
information contained in the super-edges, then the privacy constraint is indeed 
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satisfied. The data owner can choose to publish either the generalized graph or a 
sampled graph with the same properties as the ones described by the generalized 
graph. 

The approach in [5] does not fix a unique quasi-identifier, but leaves it up to 
the data owner to choose which structural attributes are important to publish 
and/or protect. Among the previous work we have found, it is also probably 
the approach that is closest to the one presented in this article. 

3.2 k-anonymous graphs with respect to the degree 

If we choose the degree of the vertices as the quasi-identifier of the graph, then 
we obtain the definition of fc-anonymous graph proposed by Liu and Terzi |13j . 
The reason for their proposal seems to be of pragmatic nature. Without doubt 
they are aware of the fact that the degree of the graph is not the only attribute 
that can be used as a quasi-identifier. However they explore the possibility to 
anonymize the graph with respect to this sole attribute while making as little 
changes in the graph as possible, using a greedy method. 
Their definition of fc-anonymity for graphs is as follows. 

Definition 11. UStf A graph (V, E) is k-degree anonymous if every number 
that appears as a degree of a vertex in V , appears as the degree for at least k 
vertices in V . 

3.3 k-Anonymous graphs with respect to isomorphic 1- 
neighborhoods 

If we instead consider that the quasi-identifier of the graph is the induced sub- 
graph of the neighbors of a vertex, then we obtain the definition of fc-anonymous 
graph proposed by Zhou and Pei [23]. Given a graph G = (V,E) and a vertex 
v e V, the d-neigborhood Neighborly) is the induced subgraph of the set of 
vertices of distance d from v. For d = 1, the 1-neighborhood Neighborly) is 
the induced subgraph of the set of vertices that share an edge with v. 

Definition 12. /|2J/ Let G = (V,E) be a graph. Then, the 1-neighborhood of 
v 6 V is the induced subgraph of the neighbors ofu, denoted by Neighborly,) = 
G(N(u)) where N(u) = {v\(u,v) £ E}, and where G(N(u)) is defined with the 
vertices N(u) and the edges -EjvO) = {{ u i v)\(u, v) € E and u G N(u) and 

v e N(u)} 

A graph isomorphism between two graphs is a mapping that transforms one 
graph into the other by reindexing the vertices. When there exists a graph 
isomorphism between two graphs, then we say that the graphs are graph iso- 
morphic. The definition of /c-anonymity for graphs based on isomorphic 1- 
ncighborhoods is as follows. 

Definition 13. \23$ Let G = (V, E) be a graph. The graph G' = {V, E') is 

a k-anonymization of G if V C V' , E C E' and for every vertex in V there 
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are at least (k — 1) other vertices V\, . . . , v^-i £ V such that Neighbor q,{A(u)), 
Neighbor q,(A(v\)), Neighbor Q,(A(vk-i)) are isomorphic. 

In |23j there is also an algorithm to accomplish fc-anonymity according to 
Definition Q2J 

3.4 (k,l)-Anonymous graphs with respect to subsets of 
neighborhoods 

Yet another version of fc-anonymity for graphs has been proposed. In this def- 
inition, a part from the parameter fc, an additional parameter I is used. The 
parameter fc plays the same role as in fc-anonymity. The definition, proposed 
by Feder et al. [5], is given below. 

Definition 14. A graph G = (V,E) is (k, I) -anonymous if for each vertex 
v G V, there exists a set of vertices U C V not containing v such that \U\ > k 
and for each u £ U the two vertices u and v share at least I neighbors. 

3.5 A criticism of (k,l)-anonymity 

With the following example we show that for any pair (fc, I) with k < I it is 
possible to find a graph that is (fc, Z)-anonymous, but in which reidentification is 
possible for a large proportion of the vertices, using only two of their neighbor 
vertices. 

Example 15. Let k, I, and m be three arbitrary integer numbers such that k > 
I > 1, and m > 2. Define a graph G with vertices V = {t>o, • " ' j ^fe-i, Uo, • • • , Um-l} 
and let the edges E be defined as the union of the following sets of edges: 

(i) (Vi,Vj) for allvi,Vj e{v ,--- ,v fc _i} 

(ii) (ui, ( m od m )) and ( u (i+i) (mod m ),Ui) for i in {0, . . . , m - 1} 

(iii) for all Ui € {uq, . . . , it TO _i}, include (ui, v) and (v, Ui) for all v £ Wj where 
Wt is a subset of {uo, • ■ ■ , Vk-i} of cardinality I. 

Then, this graph satisfies (fc, I) -anonymity. In addition, it is easy to see that any 
vertex Vi can be reidentified by the pair of vertices ( mo d m)> l H+i (mod m) ■ 

The fc vertices {vo, ■ ■ ■ , ffe-i} in Example II 5[ which we denote by V\, form 
a clique, that is, a subgraph which is complete. The m vertices {uq, . . . , u m -i}, 
which we denote by V2, arc only connected with two other vertices in V2 and I 
vertices in the clique V\ . The graph has m + k vertices and 2k(k — 1) + 2m + 2ml 
edges. 

Observe that the parameter m can be as large as desired. Therefore, we can 
make the proportion of vertices that can be reidentified by only two neighbors 
close to 1 . Also note that the selection of k and I is arbitrary. The only require- 
ment is that k > I. In case of k < I, it is easy to see that an (I, Z)-anonymous 
graph constructed as in the example above will satisfy (k, ?)-anonymity. 
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Figure 1: Example of a graph satisfying (k = 8, 1 = 3)-anonymity but where 
information about two neighbours of any of the m = 12 nodes in the border of 
the graph reidentihes it 

Example 16. Figure\]]illustrates the construction of a (fc = 8,1 = 3)- anonymous 
graph with m = 12 vertices in the border following the construction of Exam- 



4 A definition of k-anonymity for graphs 

In this section we will present the definition of fc-anonymity which we consider 
to be the appropiate for graphs. 

Let (V, E) be a graph and let v be a vertex in V. Define the neighbors of v 
as the set of vertices of distance one to v, that is, the set of vertices 

N{v) := {u e V : (u, v) e E}. 

We give the following definition of fc-anonymous graph. 

Definition 17. Let G = (V,E) be a graph. We say that G is k-anonymous if 
for any vertex V\ in V , there are at least k distinct vertices in V , such 

that N(v t ) = N(v x ) for all i G [1, &]. 

This definition of fc-anonymity is appropiate for a data owner that has deter- 
mined the quasi-identifier of the graph to be the sets of neighbors of the vertices. 
A graph that is fc-anonymous following this definition has an adjacency matrix 
in which every row vector appears at least fc times. The adjacency matrix is a 
lossless representation of the graph, in the sense that it completely determines 
the graph. Therefore it can be deduced that a graph that is fc-anonymous fol- 
lowing Definition [17] is fc-anonymous with respect to any other graph property, 
since these properties are implicitly present in the matrix. In such a graph, 
there is a partition of the set of vertices, such that the vertices in the same part 
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share exactly the same neighbors. Observe that they do not only share neighbor 
set, but that they also share non- neighbor set. 

4.1 Characterizing the k-anonymous graphs 

In this section we study the characterization of fc-anonymous graphs. We present 
two propositions that establish conditions for this types of graphs. First, we 
consider regular fc-anonymous graphs and then non-regular ones. 

4.1.1 Regular k-anonymous graphs 

Let G = (V, E) be a regular graph of degree d that is fc-anonymous. The subsets 
of vertices that share their neighbors form a partition of the vertex set of the 
graph, 

V = Vi U • • • U V n 

such that for i ^ j we have N(v) = AT (it) for all v,u & Vi and N(v) 7^ N(w) 
for all v G Vi and w £ Vj. However, note that in general it is not true that 
N(v) n N(w) = for D G Vi and w G Vj when i 7^ j. 

Choose a vertex v\ G V and let V\ — {v±, . . . ,v n } be the other vertices in V 
such that N := N(vi) = N(Vi) for all i G [2,n]. Because G is fc-anonymous, n 
is larger than fc. Let {uj}j =1 be the d vertices in N. For any j G [1, d], we then 
have that Vi belongs to N(iij) for all i G [l,fc] and N(v,j) has cardinality d so 
that k < d. 

Fix uj for some jo G [1, d\. Then there are vertices {w s }^ =2 in V such that 
N(uj ) = N(w s ) for all s G [2,fc]. In particular, Vi belongs to N(w s ) for all 
i G [1, fc] and all s G [2, fc], so that w s G N(vi) and therefore u; s G {uj}j =1 = N 
for all s G [2,fc]. This implies that any equivalence class Vi of vertices sharing 
neighbors that contains one of the vertices Uj , is contained in the set of vertices 
N = {uj}j =1 . Therefore there is a partition of N into one or several equivalence 
classes. The classes of this partition is a subset of the classes of the partition of 
the whole vertex set. 

If in addition we assume that all the equivalence classes of the partition of 
the vertex set have the same cardinality \Vt\ = fc, then we can deduce that fc 
divides d. In this case, fc of course also divides the order \V\ of the graph. 

Observe that in general the equivalence classes do not need to have the same 
cardinality. The only requirement is that the cardinalities should all exceed fc 
and that the cardinalities of the equivalence classes in a neighbor set N sum 
up to d. One can for example construct a regular graph of degree 7 that is 3- 
anonymous, by splitting the neighbor sets into equivalence classes of cardinality 
3 and 4. 

We collect the results in this section in the following Proposition [TBI 

Proposition 18. Let G = (V, E) be a d-regular, k-anonymous graph according 
to Definition \17\ Then the following conditions are satisfied. 

• fc < d; 
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• There is a partition of V = Vi U • • • U V n such that for i ^ j we have 
N(v) = N(u) for all v,u E V z and N(v) ^ N(w) for all v E Vi and 
w EVj. However, in general N(v) n N(w) 7^ for v £ Vi and w E Vj; 

• Fix a vertex v EV. Then there is a partition of N(v) into one or several 
equivalence classes. The classes of this partition form a subset of the 
classes of the partition of the whole vertex set; 

• If all the equivalence classes Vi have the same cardinality \Vi\ — k, then k 
divides d and \ V\. 

4.1.2 Non-regular k-anonymous graphs 

In non-regular fc-anonymous graphs, the vertex set of the graph will also form a 
partition of classes of vertices that share neighbors. By definition, the cardinality 
of these classes will be at least k. 

Many of the arguments we have made for regular fc-anonymous graphs are 
true also for non-regular fc-anonymous graphs. Let N be the neighbor set of 
one class Vi of vertices. Then Vi are neighbors to the vertices in N. In a 
connected graph all vertices have neighbors, and therefore the minimum degree 
of a connected fc-anonymous graph must be larger than k. 

We also see that, by the same argument as in the regular case, fixed a vertex 
u E N, the partition Vj to which it belongs must contain only vertices in N. 
Therefore, also in this case the neighbor sets of the graph are partitioned into 
equivalence classes of vertices that share the same neighbors. 

As commented previously, when k is large compared to the order of the graph 
|V|, then there are only a few graphs of that order that are fc-anonymous. This 
implies that if we protect a graph of small order so that it gets fc-anonymous 
for a large fc, then the information loss is important. However, if fc is small and 
the order of the graph is large, then much information is still kept. 

The following Proposition [TO] collects the results presented in this section 
and is a generalization of Proposition fTS] 

Proposition 19. Let G = (V, E) be a k-anonymous graph according to Defini- 
tion \17\ Then the following conditions are satisfied. 

• The minimum degree of G is larger than fc; 

• There is a partition of V = Vi U • • ■ U V n such that for i ^ j we have 
N(v) = N(u) for all v,u E Vi and N(v) ^ N(w) for all v E Vi and 
w E Vj ;. However, in general N(v) PI N(w) ^ for v E Vi and w E Vj; 

• Fix a vertex v E V. Then there is a partition of N(v) into one or several 
equivalence classes. The classes of this partition form a subset of the 
classes of the partition of the whole vertex set. 



16 



5 A definition of (k,l)-anonymity for graphs as 
a relaxation of /c-anonymity for graphs 

The definition we just presented has the problem that it is sometimes rather re- 
strictive, in particular for small data sets. Observe that if k is large in relation to 
the order \V\ of the graph, then there is only a small number of non-isomorphic 
graphs that will satisfy the criterion of fc-anonymity. Under these circumstances, 
the usefulness of the anonymized graph is therefore limited. This fact justifies 
the following relaxation of the definition of fc-anonymity for graphs. Following 
the idea in [8], we introduce a second parameter I, and consider that the graph 
is (fc, /)-anonymous if it is fc-anonymous with respect to any subset of cardinality 
at most I of the neighbor sets of the vertices of the graph. The phrase "a subset 
of cardinality at most I of the neighbor sets of vertices" has two distinct inter- 
pretations, resulting in two distinct definitions of (fc, I) -anonymity for graphs. 
Which of the two definitions should be used, depends on the context. 

If the graph has no multiple edges (a pair of vertices can be connected by at 
most one edge), then the row vector in the adjacency matrix that represents the 
neighbor set N(v) of a vertex v is a vector in the space {0, 1}™, where \V\ = n 
is the number of vertices in the graph. If the graph has multiple edges, then the 
space is (N U {0}) ra . Interpret a subset of cardinality at most I of the neighbor 
set of v to be the entries of the vector N(v) which are indexed by a subset of 
the indices of N(v) of cardinality at most I. In this way we characterize the 
vertex v by subsets of both its neighbors and its non-neighbors and we give the 
following definition of (fc, Z)-anonymity for graphs. 

Definition 20 ((&, i)-anonymity for graphs (I)). Let G = (V,E) be a graph. 
We say that G is (k, I)- anonymous if for any vertex v\ G V and for all subset of 
indices I C [1, |iV(ui)|] of cardinality \I\ < I there are at least k distinct vertices 
{ v i}i=i such that N(vi) and N(vi) coincide over I for i € [1, k]. 

In a graph that satisfies Definition I20| an adversary who fixes a vertex v 
for reidentification and who has access to the induced subgraph on a subset of 
vertices of the graph of at most cardinality I as auxiliary information, will only 
be able to reidentify v with probability at most 1/fc. 

In contrast to Definition 1201 we could interpret a subset of cardinality at 
most I of the neighbor set of v to be a subset of the neighbors of v. In this way 
the non- neighbors are not taken into account. Using this interpretation we give 
the following formal definition of (fc, Z)-anonymity. 

Definition 21 ((A), Z)-anonymity for graphs (II)). Let G = (V,E) be a graph. 
We say that G is (k, I) -anonymous if for any vertex V\ € V and for all subset 
S C N(vi) of cardinality \S\ < I there are at least k distinct vertices {wi}i=i 
such that S C N(vi) for i € [1, 

In a graph that satisfies Definition 12 1 1 an adversary who fixes a vertex v 
for reidentification and who has access to the induced subgraph of a subset of 
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vertices of the graph that contains at most I of the neighbors of v as auxiliary 
information, will only be able to reidentify v with probability at most 1/fc. 

Observe that the fact that Definition [2UJ and Definition arc relaxations 
of Definition [T71 implies that a graph that satisfies (fc, Z)-anonymity is not in 
general fc-anonymous, and that one could even find examples of (fc, Z)-anonymous 
graphs in which every vertex is uniquely identified by some property, say, by 
their degree. 

This observation means that in a situation where the data owner considers 
that there is an elevated risk that the adversary could have access to some 
auxiliary information besides a subgraph containing at most I neighbors of any 
vertex, further protection is recommended. For example, in the case when 
the additional auxiliary information consists of the degrees of the vertices of 
the graph, the data owner could consider a graph protection method which 
combines (fc, /)-anonymization and fc-anonymization with respect to the degree. 
The latter method can be found in [15] . 

However, whenever the auxiliary information about the graph that is avail- 
able to the adversary is restricted to 

• the induced subgraph of the original graph on at most I vertices in the 
case of Definition [20j or 

• the induced subgraph of the original graph on a subset of the original 
vertices that contains at most I neighbors of any of the original vertices, 
in the case of Definition [21] 

then the information that he has about the degrees of the vertices is equally 
restricted, so that the data protection in a (fc, i)-anonymous graph is just as 
high as it claims to be. 

It is obvious from the definition that k must be smaller than the order \V\ 
of the graph. If we assume that the graph contains no loops, then the set of k 
vertices that share neighbors and the set of I neighbors that they share must be 
disjoint, so that I can not be larger than |V| — k. These bounds are attained, 
since the complete graph (V,E) on n := \V\ vertices, is (k, n — fc)-anonymous 
for all k < n. 

Consider a graph that is (fc, Z)-anonymous according to Definition [21] Let 
d be the minimum degree of the graph. If d is smaller than I, so that there 
is a vertex v with a smaller number of neighbors than I, then v can share at 
most d neighbors with other vertices, so that the graph can be at most (k, d)- 
anonymous. Therefore, for (fc, Z)-anonymity according to Definition |2T] it should 
always be assumed that I < d. 

Observe that if a vertex share a set of I neighbors with k other vertices, 
then it also shares any subset of these I neighbors with the same k vertices. 
Therefore, if a graph is (k, i)-anonymous according to Definition [5T] then it is 
also (k, Z')-anonymous according to Definition [5T] for all I' < I. 

Also, if a vertex share a set of I neighbors with k other vertices, then it also 
shares the same set of I neighbors with any subset of the k vertices. Therefore, 
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if a graph is (fc, Z)-anonymous according to Definition ET1 then it is also (k',l)- 
anonymous according to Definition [2T1 for all k' < k. 
We collect these results in the following Proposition 

Proposition 22. Let G = (V,E) be a (fc, I) -anonymous graph, following Defi- 
nition 0?H Then the following conditions are satisfied. 

• If G has no loops, then k + 1 < \V\; 

• I is smaller or equal to the minimum degree of the graph; 

• G is (fc, I') -anonymous for all I' < I; 

• G is (fc', l)-anonymous for all k' < k. 

6 Algorithms for the k-anonymization of graphs 

In this section we will present three different algorithms. The first is an algo- 
rithm for fc-anonymization of databases. The second algorithm determines the 
degree of (fc, l)-anonymity of a given graph, that is, given a k it determines the 
largest I for which the graph is (fc, /)-anonymous. The third algorithm increases 
the degree of (k, Z)-anonymity of a graph. More precisely, if the algorithm is 
given a graph that is (k, Z)-anonymous, then it returns a similar graph that is 
(fc, Z')-anonymous, with I' > I. 

6.1 A k-anonymization algorithm 

As commented in Sectional it is easy to see that in a graph that is fc-anonymous 
according to Definition [T7] there exists a partition of the vertex set in classes of 
cardinality at least fc, so that the vertices in a class of the partition all share the 
same neighbors. This partition is easy to find. The row vectors (or, equivalently, 
the column vectors) of the adjacency matrix of the graph represent the neighbor 
set of their corresponding vertex, so that the vertices in the same class of a fc- 
anonymous graph will have equal row vectors (and column vectors) . Since every 
class contains at least k vertices, the adjacency matrix of a fc-anonymous graph 
is a table that satisfies fc-anonymity. 

Now suppose that we have a graph with an adjacency matrix A that does 
not satisfy fc-anonymity and that we want to transform A in order to obtain 
another table A' that is similar to A, satisfies fc-anonymity and is the adjacency 
matrix of a graph. That is, suppose that we want to define a method for k- 
anonymization of graphs. In this section we give an algorithm (Algorithm [TJ 
that describes such a method. 

The algorithm is based on a clustering algorithm for graphs. We require that 
the clustering algorithm returns a partition of the vertex set V of the graph and 
that each cluster or class of vertices contains at least fc vertices. In [TU], the 
authors use simulated annealing in order to find a partition of the vertices that 
satisfy fc-anonymity and minimizes information loss, via a maximum likelihood 
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approach. Heuristic methods are nice, because they work. However, other 
methods may offer more theoretical control over the properties of the output of 
the algorithm. 

In order to obtain good clustering results the choice of the distance to use 
is crucial. Not only classical distances are used for clustering, but also weaker 
topology concepts, like similarities [TJ []3J and proximity relations [TT]. The 
paper [18] describes some available algorithms for clustering of graphs, and dis- 
cusses how to define similarities between vertices. In particular, two similarities 
for clustering of the neighbor sets of the vertices of a graph are compared; the 
Manhattan similarity, based on the Manhattan distance, and the so-called 2- 
path similarity. The Manhattan similarity measures how many equal entries 
the two vertices have in the adjacency matrix. Formally, we have the following 
definition. 

Definition 23. Given two vectors in the adjacency matrix ofG, u, v S {0, 1}' V ' ; 
we denote by simi^{u,v) the Manhattan orl\ similarity between u andv, so that 

n 

sim h (u,v) = \V\ - \u[i] - v[i]\. 

i=l 

The 2-path similarity measures the number of paths of length 2 between 
the two vertices. This can be calculated by taking the square of the adjacency 
matrix, and we define the 2-path similarity in this manner. 

Definition 24. Given two vectors in the adjacency matrix of G, u, v £ {0, 1}' V ' , 
we denote by sim2- pa th(u,v) the 2-path similarity between u and v, so that 

a 

sim 2 - P ath{u,v) = y^u[i}v[i}. 

i=l 

It is interesting to note how two vertices that share many neighbors have 
many paths of length two between them, or, expressed in another way, they 
have many quadrangles that passes through them. As explained in [TB], the 
Manhattan similarity measures the similarity between vertices with respect to 
both neighbors and non-neighbors, while the 2-path similarity only measures 
the similarity between vertices with respect to their neighbors, so that a com- 
mon non- neighbor does not change the similarity between two vertices. The 
differences between these two similarities should be compared with the differ- 
ences between the two definitions of (fc, ?)-anonymity in Definition 1201 and Defi- 
nition [2TJ Algorithm [2] and Algorithm [3] illustrates the connection between the 
similarities and the definitions of (k, ^-anonymity. 

A clustered graph can be published as a generalized graph as described in 
[5]. When the graph is fc-anonymous, it is possible to publish a generalization of 
the graph that satisfies /c-anonymity, without any information loss at all. This 
is indeed the idea behind fc-anonymity for graphs. 

However, when the graph is not /c-anonymous, a generalization of the graph 
that satisfies /c-anonymity is never lossless. We want a method to transform 
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Input: A graph G = (V,E) and a natural number k < \V\ 
Output: A graph G' = (V, E') that is /c-anonymous according to 
Definition [TTl 

Calculate the matrix S = (sy)^Li with Sij := sim(vi,Vj) for Vi,Vj G V; 
Partition the rows of S in clusters, hence obtaining a family of clusters C 
of V; 

foreach Ci ,Cj G C do 

if ${{u,v) £ Ci x Cj : 3uv G E} > \Ci\\Cj\/2 then 
foreach (u, v) G C; x Cj do 
if uv G' E then 

Add uv to E; 
end 
end 
end 
else 

foreach (u, v) G C; x Cj do 
if uw G E then 

Delete from E; 
end 
end 
end 
end 

Return G; 

Algorithm 1: An algorithm for fc-anonymization of graphs using clustering 
and plurality rule 

any graph into a protected graph that satisfies fc-anonymity according to Defi- 
nition [T7] We also want the protected graph to be similar to the original graph, 
so that the information loss is kept small. 

6.2 Algorithms for the calculation of the degree of (k,l)- 
anonymity of a graph, given k 

If we calculate the Manhattan similarity between all the vertices in the graph, 
then we can determine the highest I such that the graph is (fc, Z)-anonymous 
according to Definition [20l for k fixed. This I define neighborhoods or balls 
around each vertex v; set of vertices {u} that satisfy simi^v^) > /, with at 
least k vertices in each neighborhood. When k = 1, then the largest I that 
gives us this family of neighborhoods is trivially equal to the order of the graph 
I = \V\, When k > 1, then / might be smaller, somewhere between and the 
order of the graph. The cardinality of the set of neighborhoods is \V\ and it 
forms a fuzzy clustering of the neighbor sets of the vertices of the graph. The 
centroids of these clusters are the points (Z/(2))'^' that represent the neighbor 
sets, and since every cluster has cardinality fc, it is obvious that whenever k > 1 
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Input: A graph G = (V,E) and a natural number k < \V\ 
Output: The largest I such that G is (fc, Z)-anonymous according to 
Definition [20] 

s := \V\- 

while Exists v £ V such that : sim^ (u, v) > s} < k do 

I 

end 

Return I := s; 

Algorithm 2: An algorithm that given G and k computes the largest I for 
which G is (fc, Z)-anonymous, using the Manhattan similarity 

Input: A graph G = (V,E) and a natural number k < \V\ 
Output: The largest I such that G is (fc, Z)-anonymous according to 

Definition I2H 
foreach Vi G V do 

s[i] := Degree(wi); 
end 

while Exists V{ € V such that #{u € V : siiri2- P ath{u, Vi) > s[i]} < k do 

s[i] - -; 
end 

Return I := minj(s[i]) ; 
Algorithm 3: An algorithm that given G and k computes the largest I for 
which G is (fc, Z)-anonymous, using the 2-path similarity 

then the clusters overlap. 

If we instead use the 2-path similarity, then we can determine the largest I 
so that the graph is (k, /)-anonymous according to Definition |2"T1 For k = 1, the 
largest / is equal to the minimum degree of the graph, and if k > 1, then the 
largest possible I is somewhere between and the minimum degree. 

In this way we obtain two different measures on the degree of anonymity 
of the original graph; the largest parameters so that the graph satisfies the 
definitions of the two versions of (k, Z)-anonymity. Which of the two measures 
is the most useful, depends on the context, or more precisely, it depends on if 
non-neighbors are useful for reidentification or not. 

Hence, we present here an algorithm (Algorithm [2]) that, given a graph 
G = (V,E) and a positive integer fc, calculates the largest I such that G is 
(k, Z)-anonymous according to Definition l20l The algorithm shows the relation 
between the Manhattan or l\ similarity [TB] and the (k, Z)-anonymity according 
to Definition [20l described before. 

Next we present an algorithm (Algorithm [3]) that, given a graph G = (V, E) 
and a positive integer k, calculates the largest I such that G is (k, Z)-anonymous 
according to Definition[5IJ The algorithm shows the relation between the 2-path 
similarity |18| and the (fc, Z)-anonymity according to Definition 1211 as described 
before. 
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Input: A graph G = (V,E) and a natural number k < \V\ 

Output: A graph G' = (V',E') such that if G is (fc, Z)-anonymous, then 

G' is (fc, Z')-anonymous with V = 1 + 1 
Calculate the largest / such that G is (k, Z)-anonymous, using Algorithm |2] 
or Algorithm [3l 
V :=l + l: 

while Exists v G V such that G V : sim(u,v) >l'}<k do 
if uv G E then 

Delete uv from E; 
end 

Delete v from V; 
end 

Return G; 

Algorithm 4: An algorithm that given a (fc, Z)-anonymous graph G returns 
a (fc, I + l)-anonymous graph 

6.3 An algorithm to increase the degree of (k,l)-anonymity 
of a graph 

Finally we present an algorithm (Algorithm [3} that, given a graph G that is 
(fc, J)-anonymous with respect to the similarity sim, returns either a graph G' 
that is based on G but that is (fc, Z')-anonymous with V = I + 1 or the empty 
graph without vertices. 

The algorithm deletes vertices v that have a set of similar vertices jj{it G V : 
sim(u, v) > I'} which is too small, that is, smaller than fc. Since the deletion 
of a vertex affects the ncigbor sets of the other vertices v so that G V : 
sim(u, v) > I'} may decrease, causing the deletion of more vertices in the next 
execution of the loop, there is a risk that the algorithm deletes all vertices of the 
graph. In order to avoid this phenomenon, we recommend, for some v £ V with 
§{u G V : sim(u,v) > I'} < fc, the addition of new vertices to V to augment 
${u G V : sim(u, v) > I'}. Such a new vertex v must be connected to the already 
existing vertices in V in a way so that the set ft{u G V : sim(u, v) > I'} > fc, 
so that we do not introduce new problematic vertices. An easy solution to 
this problem is to let the new vertices v be copies of the problematic vertex v. 
However, the use of this solution would cause the algorithm to loose its status 
as anonymization method, since the records in the resulting table would not 
correspond to distinct individuals, so that it fails to be a database according 
to our definition. As a consequence, the risk of rcidentification for the vertices 
protected in this way will be higher than for the vertices in a graph that is 
protected using a real method of (fc, i')-anonymization method. 
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7 Conclusion 



In this article we have provided a formal framework for rcidcntification in gen- 
eral. We have defined n-confusion as a concept for modelling the anonymity 
of a database table and we have proved that n-confusion is a generalization of 
fc-anonymity. Then after a short survey on the different available definitions of 
fc-anonymity for graphs we provided a new definition for fc-anonymity, which we 
consider to be the correct definition. It has been explained how this definition 
can be used in combination with n-confusion, for the anonymization of data 
from, for example, social networks. 

We have provided a description of the fc-anonymous graphs, both for the 
regular and the non-regular case. However, under some conditions our definition 
of fc-anonymity is quite strict, so that it is only satisfied by a small number of 
graphs. In order to avoid this problem, we have introduced the more flexible 
definition of (fc, Z)-anonymity. Our definition of (fc, Z)-anonymity for graph is 
meant to replace the definition in [8], which we have proved to have severe 
weaknesses. We have given two variants of the definition of (fc, Z)-anonymity, 
which may serve under different conditions. 

We have also provided a set of algorithms; one algorithm that given a graph 
G and a natural number fc returns a graph G" based on G that is fc-anonymous, 
two algorithms that given a graph G and a natural number fc calculates the 
largest I such that G is (fc, Z)-anonymous according to our two different defini- 
tions of (fc, Z)-anonymity, and finally, one algorithm that given a graph G that 
satisfies (fc, Z)-anonymity returns a graph G' similar to G that satisfies (fc, I + 1)- 
anonymity. 
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